More Data Stuff

Introduction

Getting the Data

In order to predict the fantasy scores of NBA players in the 2021-2022 season I pulled data from basketball reference using nbastatR, grabbing the last 20 years of season totals from players, in both normal and advanced stats, which can conveniently be done with just one line of code!

df <- nbastatR::bref_players_stats(seasons = 2000:2021, tables = c("totals", "advanced"), widen = TRUE, assign_to_environment = F)

The only initial data manipulation is to create the response variable for the dataset, the fantasy points, which is done with a very simple formula in my league, the coefficients are arbitrarily chosen and lead to some interesting weighting in our modelling later on.

coefs <- tibble::tibble(Points = 1,
                        Rebounds = 1.2,
                        Assists = 1.5,
                        Steals = 3,
                        Blocks = 3,
                        Turnovers = -1,
                        `3 Point FGM` = 1) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Multiplier") %>%
  gt::gt() %>%
  data_color(
    columns = c(Multiplier),
    colors = scales::col_numeric(
      palette = paletteer::paletteer_d(
        palette = "rcartocolor::Mint"
      ) %>% as.character(),
      domain = NULL
    )
  ) %>%
  DGThemes::gt_theme_duncan()
coefs %>%
  tab_header(title = "Multipliers for Points in my Fantasy League")

Setting up the Database

The data we have is tabular, and can easily be setup an id which I did by adding a column that is just the season and player name pasted together (with additional notation of 01 to ensure players with the same name don’t get merged) in the format that is slightly more graphically represented below with the top 10 fantasy seasons ever in my league.

id fantasy_points url_player_headshot
westbru01_2017 5105.8
hardeja01_2019 4957.6
hardeja01_2017 4782.3
onealsh01_2000 4688.1
garneke01_2004 4660.3
jamesle01_2018 4585.3
jamesle01_2009 4501.1
garneke01_2003 4464.9
westbru01_2018 4439.8
bryanko01_2003 4439.3

This data lends itself quite easily to an SQLite relational database which I setup with RSQLite. First I save the data as a csv in the data folder, and then I establish a local connection and write the table to the database file. In this case I set overwrite = TRUE since I have written to the db before and am just saving it again, but it is generally unnecessary.

readr::write_csv(df, "Data/player_season_totals.csv")

con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "nba.db")

upload_data <- readr::read_csv("Data/player_season_totals.csv")

RSQLite::dbWriteTable(conn = con, 
                      name = "player_seasons", 
                      value = upload_data, 
                      overwrite = TRUE)

Regular Updates

Now to add data to the table regularly given the oncoming 2021-2022 season I add to the table regularly by running the following script on the server I hosted the data on.

new_df <- nbastatR::bref_players_stats(2022, tables = c("totals", "advanced"), widen = T, assign_to_environment = F)
new_df %>%
  janitor::clean_names() %>%
  readr::write_csv("Data/latest_season.csv")

I made this graph for fun using the wonderful gtextras package by Tom Mock and just want to include it in the post now.

top5_fg <- new_df %>%
  group_by(slug_position) %>%
  arrange(desc(pct_fg)) %>%
  slice_head(n = 5) %>%
  dplyr::filter(slug_position %in% c("C", "SF", "PF", "PG", "SG")) %>%
  transmute(Position = slug_position,
            url_player_headshot,
            Player = name_player,
            `% FG` = pct_fg, 
            `% EFG` = pct_efg, 
            `% TS` = pct_true_shooting) %>%
  gt::gt() %>%
  fmt_percent(
    columns = c(`% FG`, `% EFG`, `% TS`)
  ) %>%
  gtExtras::gt_color_rows(c(`% FG`, `% EFG`, `% TS`), palette = "rcartocolor::Mint") %>%
  text_transform(
    locations = cells_body(columns = url_player_headshot),
    fn = function(x) {
      web_image(
        url = x,
        height = as.numeric(50)
      )
    }
  ) %>%
  cols_label(
    url_player_headshot = " "
  ) %>%
  tab_header("Top 5 Field Goal % in Each Position in the 2021-2022 Season") %>%
  DGThemes::gt_theme_duncan() %>%
  suppressWarnings()

top5_fg
Top 5 Field Goal % in Each Position in the 2021-2022 Season
Position Player % FG % EFG % TS
C
C Jarrett Allen 94.10% 94.10% 93.80%
C Daniel Gafford 87.50% 87.50% 85.80%
C Dewayne Dedmon 83.30% 91.70% 94.50%
C Gorgui Dieng 80.00% 90.00% 90.00%
C Richaun Holmes 80.00% 83.30% 82.70%
PF
PF Nemanja Bjelica 75.00% 79.20% 81.50%
PF P.J. Tucker 75.00% 1.00% 1.00%
PF Domantas Sabonis 71.90% 81.30% 81.80%
PF Aaron Gordon 66.70% 70.80% 72.70%
PF Jaden McDaniels 66.70% 66.70% 66.70%
PG
PG Jrue Holiday 71.40% 85.70% 85.70%
PG Patty Mills 68.80% 1.00% 1.00%
PG Trey Burke 66.70% 83.30% 83.30%
PG Tre Jones 66.70% 66.70% 66.70%
PG Ja Morant 58.60% 60.30% 61.90%
SF
SF Thanasis Antetokounmpo 80.00% 80.00% 83.30%
SF Kenyon Martin Jr. 75.00% 75.00% 71.70%
SF Deni Avdija 66.70% 75.00% 72.70%
SF Dalano Banton 66.70% 75.00% 75.00%
SF Andre Iguodala 66.70% 77.80% 81.00%
SG
SG Anfernee Simons 83.30% 91.70% 91.70%
SG Seth Curry 76.50% 94.10% 92.30%
SG John Konchar 66.70% 66.70% 66.70%
SG Eric Bledsoe 62.50% 65.60% 63.50%
SG Zach LaVine 61.10% 70.80% 76.70%
# top5_fg %>%
#   gtsave(here::here("images/top5.png"))

Analysis

Finally, some machine learning, first we filter down to the relevant variables and start off by using the relevant variables for our model to predict maximum fantasy points.

normal_df <- df %>%
  mutate(id = slugPlayerSeason) %>%
  transmute(count_games = countGames,
            count_games_started = countGamesStarted,
            pct_fg = pctFG,
            pct_fg3 = pctFG3,
            pct_fg2 = pctFG2,
            pct_efg = pctEFG,
            pct_ft = pctFT,
            minutes = minutesTotals,
            fgm_total = fgmTotals,
            fga_total = fgaTotals,
            fg3m_totals = fg3mTotals,
            fg3a_totals = fg3aTotals,
            fg2m_totals = fg2mTotals,
            fg2a_totals = fg2aTotals,
            ftm_totals = ftmTotals,
            fta_totals = ftaTotals,
            orb_totals = orbTotals,
            drb_totals = drbTotals,
            trb_totals = trbTotals,
            ast_totals = astTotals,
            stl_totals = stlTotals,
            blk_totals = blkTotals,
            tov_totals = tovTotals,
            pf_totals = pfTotals,
            pts_totals = ptsTotals,
            fantasy_points = fantasy_points)

Data Exploration

Lots of statistics here idk really.

normal_df %>%
  pivot_longer(count_games:pts_totals, names_to = "stat", values_to = "value") %>%
  ggplot(aes(value, fantasy_points, color = stat)) +
  geom_point(alpha = 0.6) +
  ggtitle("Fantasy Points by Statistic each Season from the 1999-2000 to 2020-2021 NBA Season") +
  labs(y = "Fantasy Points", x = "") +
  paletteer::scale_color_paletteer_d(palette = "dichromat::SteppedSequential_5") +
  facet_wrap( ~ stat, ncol = 5, scales = "free_x") +
  theme(legend.position = "none")

ggsave(here::here("images/eda_points_stat.png"), width = 10, height = 10)

Building a Model

We start by loading the tidymodels metapackage, and splitting the data into training and testing sets.

# Later add a part to strata by 3 point era?
library(tidymodels)

set.seed(2021)
nba_split <- initial_split(normal_df)
nba_train <- training(nba_split)
nba_test <- testing(nba_split)

An XGBoost model is based on trees, so we don’t have to do anymore preprocessing than has already been done on the data, we can go straight into model specification and hyperparameter tuning.

xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(), min_n = tune(), 
  loss_reduction = tune(),                     ## first three: model complexity
  sample_size = tune(), mtry = tune(),         ## randomness
  learn_rate = tune()                          ## step size
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

xgb_spec
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
##   tree_depth = tune()
##   learn_rate = tune()
##   loss_reduction = tune()
##   sample_size = tune()
## 
## Computational engine: xgboost

Thats a lot right there so lets create a space-filling design to cover the hyperparameter space as efficiently as possible.

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), nba_train),
  learn_rate(),
  size = 30
)

xgb_grid
## # A tibble: 30 × 6
##    tree_depth min_n loss_reduction sample_size  mtry learn_rate
##         <int> <int>          <dbl>       <dbl> <int>      <dbl>
##  1         15    18       3.42e- 8       0.548     9   1.03e- 4
##  2          3     4       6.00e- 4       0.958     7   8.72e- 8
##  3          2    31       1.87e- 8       0.344     6   4.87e-10
##  4         11    27       2.00e- 6       0.857     4   8.89e- 4
##  5          2    33       3.96e- 3       0.924     2   2.24e- 6
##  6          6    15       3.71e- 5       0.804     8   2.13e- 7
##  7          9     5       4.59e- 7       0.377    24   1.39e- 8
##  8         12     6       3.54e- 2       0.270    14   7.59e- 2
##  9         12    12       8.89e-10       0.763    18   1.92e-10
## 10          4     9       1.79e- 1       0.121    18   3.51e- 3
## # … with 20 more rows

Note that mtry() had to be treated differently because it depends on the number of predictors in the data.

The model specification in the workflow must be put in for convenience. since theres no preprocessing necessary here we can go straight to add_formula() as our data preprocessor.

xgb_wf <- workflow() %>%
  add_formula(fantasy_points ~ .) %>%
  add_model(xgb_spec)

xgb_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## fantasy_points ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
##   tree_depth = tune()
##   learn_rate = tune()
##   loss_reduction = tune()
##   sample_size = tune()
## 
## Computational engine: xgboost

Next we create the cross-validation resamples for tuning the model.

set.seed(2021)
nba_folds <- vfold_cv(nba_train)

nba_folds
## #  10-fold cross-validation 
## # A tibble: 10 × 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [7007/779]> Fold01
##  2 <split [7007/779]> Fold02
##  3 <split [7007/779]> Fold03
##  4 <split [7007/779]> Fold04
##  5 <split [7007/779]> Fold05
##  6 <split [7007/779]> Fold06
##  7 <split [7008/778]> Fold07
##  8 <split [7008/778]> Fold08
##  9 <split [7008/778]> Fold09
## 10 <split [7008/778]> Fold10

Now we get ready for the big task, tuning the grid with tune_grid() using the workflow, resamples, and grid of paramaters to try. We use control_grid(save_pred = T) to explore predictions afterwards.

doParallel::registerDoParallel()

set.seed(2021)
xgb_res <- tune_grid(
  xgb_wf,
  resamples = nba_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = T)
)
xgb_res
## # Tuning results
## # 10-fold cross-validation 
## # A tibble: 10 × 5
##    splits             id     .metrics           .notes           .predictions   
##    <list>             <chr>  <list>             <list>           <list>         
##  1 <split [7007/779]> Fold01 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
##  2 <split [7007/779]> Fold02 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
##  3 <split [7007/779]> Fold03 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
##  4 <split [7007/779]> Fold04 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
##  5 <split [7007/779]> Fold05 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
##  6 <split [7007/779]> Fold06 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
##  7 <split [7008/778]> Fold07 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,34…
##  8 <split [7008/778]> Fold08 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,34…
##  9 <split [7008/778]> Fold09 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,34…
## 10 <split [7008/778]> Fold10 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,34…

Exploring Results

It’s easy to explore all metrics for the models with collect_metrics()!

collect_metrics(xgb_res)
## # A tibble: 60 × 12
##     mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
##    <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>  
##  1     9    18         15   1.03e- 4   0.0000000342       0.548 rmse   
##  2     9    18         15   1.03e- 4   0.0000000342       0.548 rsq    
##  3     7     4          3   8.72e- 8   0.000600           0.958 rmse   
##  4     7     4          3   8.72e- 8   0.000600           0.958 rsq    
##  5     6    31          2   4.87e-10   0.0000000187       0.344 rmse   
##  6     6    31          2   4.87e-10   0.0000000187       0.344 rsq    
##  7     4    27         11   8.89e- 4   0.00000200         0.857 rmse   
##  8     4    27         11   8.89e- 4   0.00000200         0.857 rsq    
##  9     2    33          2   2.24e- 6   0.00396            0.924 rmse   
## 10     2    33          2   2.24e- 6   0.00396            0.924 rsq    
## # … with 50 more rows, and 5 more variables: .estimator <chr>, mean <dbl>,
## #   n <int>, std_err <dbl>, .config <chr>

We can visualize these easily as well to understand the results

xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "R Squared")

And let tidymodels select the best performing set of paramaters for us with show_best(), its pretty convenient that the best rmse is also the best r-squared one.

show_best(xgb_res, "rmse");show_best(xgb_res, "rsq")
## # A tibble: 5 × 12
##    mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
##   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>  
## 1    23    10          4    0.0218     0.0142            0.322 rmse   
## 2    14     6         12    0.0759     0.0354            0.270 rmse   
## 3    11    35         13    0.0488    12.8               0.244 rmse   
## 4    21    20          5    0.00920    0.000000213       0.997 rmse   
## 5    18     9          4    0.00351    0.179             0.121 rmse   
## # … with 5 more variables: .estimator <chr>, mean <dbl>, n <int>,
## #   std_err <dbl>, .config <chr>
## # A tibble: 5 × 12
##    mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
##   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>  
## 1    23    10          4    0.0218        1.42e- 2       0.322 rsq    
## 2    14     6         12    0.0759        3.54e- 2       0.270 rsq    
## 3    11    35         13    0.0488        1.28e+ 1       0.244 rsq    
## 4    21    20          5    0.00920       2.13e- 7       0.997 rsq    
## 5    19    17         14    0.00303       1.97e-10       0.189 rsq    
## # … with 5 more variables: .estimator <chr>, mean <dbl>, n <int>,
## #   std_err <dbl>, .config <chr>

Or the best performing model overall

best_rsq <- select_best(xgb_res, "rsq")
best_rsq
## # A tibble: 1 × 7
##    mtry min_n tree_depth learn_rate loss_reduction sample_size .config          
##   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>            
## 1    23    10          4     0.0218         0.0142       0.322 Preprocessor1_Mo…

Now to finalize the tuneable workflow with the paramater values we have

final_xgb <- finalize_workflow(
  xgb_wf,
  best_rsq
)

final_xgb
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## fantasy_points ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = 23
##   trees = 1000
##   min_n = 10
##   tree_depth = 4
##   learn_rate = 0.0217749924129286
##   loss_reduction = 0.0141520961395455
##   sample_size = 0.321661404399201
## 
## Computational engine: xgboost

Instead of tune() placeholders, we can use real values for the models hyperparameters.

Which of the parameters is most important for variable importance?

library(vip)

final_xgb %>%
  fit(data = nba_train) %>%
  pull_workflow_fit() %>%
  vip(geom = "point")

Looks like minutes has the most importance, and then points, and then field goal makes.

And to save it with xgboost::xgb.save()! It can be reloaded later with xgboost::xgb.load().

final_xgb %>%
  fit(data = nba_train) %>%
  extract_fit_engine() %>%
  xgboost::xgb.save(., here::here("models/normal_model"))
## [1] TRUE

Finally lets use last_fit() to fit the model one last time on the training data and evaluate our model one last time on the testing set. Notice that this is the first time we have used the testing data during this whole modeling analysis.

final_res <- last_fit(final_xgb, nba_split)

collect_metrics(final_res)
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      34.5   Preprocessor1_Model1
## 2 rsq     standard       0.999 Preprocessor1_Model1

Our results indicate some overfitting we will find out. Here we use an ROC curve for the testing set. Remember that the ROC curve shows the true positive rate vs the false positive rate

final_res %>%
  collect_predictions() %>%
  ggplot(aes(fantasy_points, .pred, color = id)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_point(alpha = 0.2) +
  labs(x = "Truth", y = "Predicted Value",
       color = NULL,
       title = "Predicted and True Values for Fantasy Points")

But lets go ahead and look at some of the predictions for this year:

new_df <- readr::read_csv(here::here("Data/latest_season.csv")) %>%
  dplyr::select(!where(is.factor)) %>%
  rename(fgm_total = fgm_totals, fga_total = fga_totals)

fitted_wf <- pluck(final_res$.workflow, 1)

preds <- predict(fitted_wf, new_data = new_df) %>%
  as_vector()

pred_df <- new_df %>%
  bind_cols(predictions = preds)

Projections

Here are the current projections based on totals, it looks pretty reasonable at the top but lower down I’m seeing some difficulties, and I think that the models reliance on time played is really impacting its performance.

pred_df %>%
  arrange(desc(predictions)) %>%
  dplyr::slice(c(1:100)) %>%
  ggplot(aes(fct_reorder(name_player, predictions), predictions)) +
  geom_col(color = "#3EB489", alpha = 0.8, fill = "#3EB489") +
  coord_flip() +
  labs(x = "Player", y = "Fantasy Points Prediction", title = "Top 100 Predictions Based on Season Totals") +
  theme(legend.position = "none",
        axis.text.y = element_text(size = 7))

Shiny AppsGithub
Copyright © Duncan Gates, 2021